Chapter 16 · Fault Tolerance

Error Handling & Fault-Tolerant Systems

Errors are not exceptional — they are inevitable. Database queries fail, external APIs time out, users send bad data, and business logic hits edge cases nobody predicted. This chapter builds the mindset and concrete toolkit for detecting, handling, and recovering from every class of backend error — before it silently costs you money or trust.

The Fault-Tolerant Mindset

"The question is not whether errors will happen — it is how you will handle them when they do."

Every backend engineer must internalise a simple truth: your system will fail. Not might. Will. The sources are everywhere:

Database queries will occasionally fail or time out.
External APIs (payments, email, auth) will go down.
Users will send malformed, missing, or malicious data.
Business logic will hit edge cases no one thought of during design.

A fault-tolerant system is not one that never breaks. It is one that breaks predictably, recovers gracefully, and tells you exactly what happened. Achieving that requires a deliberate mindset shift from "I'll handle errors later" to "I'll design for failure from day one."

The best error handling starts before the error happens. Proactive detection, health checks, and validation prevent most runtime surprises.

The Five Classes of Backend Errors

Backend errors can be grouped into five broad categories. Each has a different origin, detection strategy, and fix.

① Logic Errors

App runs but produces wrong results. Hardest to detect. Can silently drain money for weeks.

② Database Errors

Connection failures, deadlocks, constraint violations, malformed SQL. Can bring the whole app down.

③ External Service Errors

Third-party APIs (payment, email, auth) time out, rate-limit, or go offline. You have no control.

④ Input Validation Errors

Users send bad, missing, or out-of-range data. Easiest to handle — if your validation layer is robust.

⑤ Configuration Errors

Missing env vars, wrong credentials on deploy. Surface at startup — not at runtime, if you do it right.

Logic Errors — The Silent Killers

Logic errors are the most dangerous class because your application keeps running — it just does the wrong thing. No crash, no stack trace, no 500 response. Just quietly wrong results accumulating over time.

Classic Example

An e-commerce platform applies a discount twice due to a bug in the promotion engine. The result: negative shipping costs. The app runs perfectly. Every order ships. The company loses money on every transaction. This goes unnoticed for weeks because no monitoring alert fires on "negative shipping cost."

Common Root Causes

Misunderstood requirements — notes from a sprint meeting were ambiguous; you implemented what you thought was asked, not what was intended.
Incorrect algorithms — a complex discount or pricing formula has an off-by-one error or a wrong operator (* instead of +).
Unhandled edge cases — a user who has never purchased before triggers a "past-purchase-based" discount path that wasn't designed for zero-purchase users.

Logic errors involving money, permissions, or security can corrupt data and produce wrong business results for months without a single error log entry. They are only found through careful testing, monitoring business metrics, and code review.

Prevention Strategies

Write unit tests for every business rule, especially discount, pricing, and permission logic.
Add business metric monitoring (e.g., alert if average order value drops by >20% in one hour).
Use property-based testing (Go: gopter, Python: hypothesis) to auto-generate edge-case inputs.
Require peer review for all payment and auth-related code changes.

Database Errors

Most backend applications are meaningless without their database. A database error of any kind means your app cannot serve real data — which usually means a broken UI or cascading failures across services.

① Connection Errors

Your backend cannot reach the database server. Possible causes:

Network partition or DNS failure between app server and DB server.
Database server is overloaded or down.
Connection pool exhausted — all pooled TCP connections are in use; new requests queue up or fail immediately.

Connection pooling keeps a fixed number of open TCP connections to the database, avoiding the overhead of a full TCP handshake + TLS negotiation on every query. Tools: pgxpool (Go), psycopg2 pool or SQLAlchemy (Python), pg Pool (Node). Size your pool carefully: too small → bottleneck; too large → DB overload.

② Constraint Violation Errors

You are trying to perform an operation that violates a database-level rule:

Constraint Type	Trigger	Appropriate Response
Unique	Insert a duplicate email / username	HTTP 409 Conflict or 400 — "Email already in use"
Foreign Key	Reference a row that doesn't exist	HTTP 404 — "Author ID not found" / 400
Not Null	Missing required column value	HTTP 400 — "Field X is required"
Check	Value fails a custom rule (e.g. price > 0)	HTTP 400 — domain-specific message

③ Query / Syntax Errors

Malformed SQL — a table name typo, referencing a column that was renamed, or a missing join condition. These are usually caught in development but can slip through if raw SQL strings are built dynamically.

Never construct SQL by string concatenation with user input. Use parameterised queries / prepared statements always. This prevents both query errors and SQL injection attacks.

④ Deadlocks

A deadlock occurs when two (or more) transactions each hold a lock that the other needs:

Fig 1 — Deadlock: two transactions waiting on each other's locks indefinitely. The DB detects and aborts one.

Postgres detects deadlocks automatically and kills one transaction with error code 40P01. Your application must retry that transaction. Prevention: always acquire locks in a consistent order across all code paths.

External Service Errors

Modern SaaS backends depend on a constellation of third-party services — payment processors (Stripe), email (Resend, SendGrid), object storage (S3), auth (Clerk, Auth0), AI (OpenAI). Every one of these is a point of failure outside your control.

Fig 2 — Each external dependency is an independent failure point your app must handle gracefully.

① Network Failures

The internet between your server and the external API is unreliable. You will encounter: connection timeouts, DNS resolution failures, network partitions, and TLS handshake errors. Set explicit timeouts on every outgoing HTTP call — never let a slow third-party API block your goroutine / thread indefinitely.

② Rate Limiting — HTTP 429

Every serious external API enforces rate limits to prevent abuse. If your app hammers an API (due to a bug, a traffic spike, or a loop error), you will receive HTTP 429 Too Many Requests.

The standard mitigation is Exponential Backoff with Jitter:

Fig 3 — Exponential Backoff: each retry waits 2× longer. Jitter spreads retries to avoid thundering herd.

③ Service Outage / Downtime

Major cloud providers (AWS, GCP) and popular SaaS services go down occasionally. Your app needs a strategy for when a critical dependency is completely unavailable:

Fallback — if Redis cache is down, fall back to direct DB reads for non-critical data.
Graceful degradation — disable the affected feature (e.g., "AI suggestions temporarily unavailable") rather than crashing the whole app.
Circuit breaker pattern — after N consecutive failures, stop sending requests to the broken service and return a cached/default response immediately. Re-try the service after a cool-down period.

Input Validation Errors

These are the easiest errors to handle because you define the rules. Your validation layer is the first line of defence: catch bad data at the entry point, before it reaches your database or business logic.

Types of Validation

Type	What It Checks	Example
Format	Shape/pattern of the value	Email regex, ISO date, E.164 phone
Range	Numeric bounds, string length, array size	Price: 0–99999, name: 2–100 chars
Required	Mandatory field present	user_id must not be null
Business Rule	Domain-specific constraint	Booking end_date > start_date
Referential	Related entity actually exists	category_id exists in categories table

Always validate at both layers: frontend (UX) and backend (security). Never trust client-side validation alone. The backend is the authoritative gate.

Return HTTP 400 Bad Request with a structured error body listing every field that failed and why. Don't return a single generic message — help the user fix all their mistakes in one round-trip.

Configuration Errors

Configuration errors happen at the boundary between environments — dev → staging → production. A missing OPENAI_API_KEY, a wrong database URL, or a forgotten secret can silently break specific features while the rest of the app appears healthy.

Fail Fast at Startup — Not at Runtime

The golden rule: validate all required environment variables before the server starts accepting traffic. If any are missing or corrupt, crash immediately with a clear error message.

❌ Bad — Runtime Failure

App starts successfully
First user hits the AI image endpoint
OpenAI call fails — key is missing
User gets a mysterious 500 error
Old deployment is already stopped
Site is down until manually fixed

✅ Good — Startup Failure

New deployment starts
Config validation runs immediately
Missing key detected → process exits with clear message
Blue-green: old deployment still running
Zero downtime — ops team fixes and redeploys

Blue-green deployments make startup-time crashes safe: the new version must pass health checks before the old version is terminated. If the new version crashes at start, the old version keeps serving traffic.

Go — Config Validation at Boot

Gopackage config

import (
    "fmt"
    "os"
    "strings"
)

type Config struct {
    DatabaseURL    string
    OpenAIKey      string
    JWTSecret      string
    ResendAPIKey   string
}

// MustLoad panics if any required variable is missing.
// Call this once in main() before http.ListenAndServe.
func MustLoad() Config {
    required := []string{
        "DATABASE_URL",
        "OPENAI_API_KEY",
        "JWT_SECRET",
        "RESEND_API_KEY",
    }

    var missing []string
    for _, key := range required {
        if os.Getenv(key) == "" {
            missing = append(missing, key)
        }
    }
    if len(missing) > 0 {
        // Crash immediately — loud and clear
        panic(fmt.Sprintf("[FATAL] missing required env vars: %s",
            strings.Join(missing, ", ")))
    }

    return Config{
        DatabaseURL:  os.Getenv("DATABASE_URL"),
        OpenAIKey:    os.Getenv("OPENAI_API_KEY"),
        JWTSecret:    os.Getenv("JWT_SECRET"),
        ResendAPIKey: os.Getenv("RESEND_API_KEY"),
    }
}

// main.go
func main() {
    cfg := config.MustLoad()  // panics here if config invalid
    server := newServer(cfg)
    log.Fatal(server.ListenAndServe())
}

Proactive Error Detection — Health Checks

"The best error handling starts before the error happens."

Health checks continuously verify that your system is working — not just that it is running. There is a critical difference:

Fig 4 — Liveness vs Readiness: the industry-standard split (Kubernetes uses both).

What to Check

Database — run a lightweight representative query. Track query time; if it jumps from 50ms to 4s, something is wrong before users notice.
External services — payment processors: run periodic test transactions; email: send to an internal address; auth: generate and validate a test token.
Configuration — verify all required env vars are loaded and non-empty at startup.
Cache warmup — ensure critical caches (session store, product catalogue) are populated before serving traffic.

Go — Deep Health Check Endpoint

Gotype HealthStatus struct {
    Status   string            `json:"status"`
    Checks   map[string]string `json:"checks"`
}

func healthHandler(db *pgxpool.Pool, rdb *redis.Client) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        checks := map[string]string{}
        overall := "ok"

        // DB check
        ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
        defer cancel()
        if err := db.Ping(ctx); err != nil {
            checks["database"] = "unhealthy: " + err.Error()
            overall = "degraded"
        } else {
            checks["database"] = "ok"
        }

        // Redis check
        if err := rdb.Ping(r.Context()).Err(); err != nil {
            checks["cache"] = "unhealthy: " + err.Error()
            overall = "degraded"
        } else {
            checks["cache"] = "ok"
        }

        status := http.StatusOK
        if overall != "ok" { status = http.StatusServiceUnavailable }

        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(status)
        json.NewEncoder(w).Encode(HealthStatus{Status: overall, Checks: checks})
    }
}

Monitoring & Observability

Health checks tell you something is broken right now. Monitoring tells you something is about to break — and gives you the context to understand why something broke after the fact.

What to Track

Category	Metrics to Monitor	Why
HTTP Layer	4xx / 5xx rate, p50/p95/p99 latency	Surface user-facing issues immediately
Database	Query duration, connection pool usage, deadlock count	Detect slow queries before timeout
External Services	Call success rate, latency, 429 count	Know when a dependency is degrading
Business Metrics	Successful transactions/min, failed payments, sign-up rate	Catch logic errors invisible to error rates
Infrastructure	CPU, memory, disk I/O, network throughput	Resource exhaustion precedes crashes

Don't only track error rates. A drop in successful transactions from 1000/min to 200/min is a critical problem — even if the error rate shows 0%. Always monitor positive business outcomes, not just failure signals.

Structured Logging (JSON)

Plain-text logs are hard to query at scale. Use structured JSON logs so log aggregation tools (Grafana Loki, Datadog, ELK) can parse, filter, and alert on them programmatically.

Python — structlogimport structlog

log = structlog.get_logger()

# Good — structured, queryable, no sensitive data
log.error(
    "payment_failed",
    user_id="u_9a3f",          # ID, not email
    correlation_id="req_abc123",
    provider="stripe",
    error_code="card_declined",
    amount_cents=4999,
)

# BAD — never log PII or secrets
# log.error("payment_failed", email="alice@example.com", card="4242...")

Recovery Strategies

Recoverable vs Non-Recoverable

Recoverable Errors

Transient network glitch to email API
Database connection pool temporarily exhausted
Rate limit 429 from external service

Strategy: Retry with exponential backoff. Queue the work. Don't give up immediately.

Non-Recoverable Errors

Redis cluster completely down
Payment processor offline for hours
Corrupt data in the DB

Strategy: Graceful degradation. Fallback. Disable the feature. Protect core functionality.

Exponential Backoff in Go

Gofunc sendEmailWithRetry(to, subject, body string) error {
    maxRetries := 5
    baseDelay  := 1 * time.Second

    for attempt := 0; attempt < maxRetries; attempt++ {
        err := emailClient.Send(to, subject, body)
        if err == nil {
            return nil  // success
        }

        if !isRetryable(err) {
            return fmt.Errorf("permanent failure: %w", err)
        }

        // Exponential backoff: 1s, 2s, 4s, 8s, 16s
        wait := baseDelay * time.Duration(1<<attempt)
        // Add jitter (±20%) to prevent thundering herd
        jitter := time.Duration(rand.Int63n(int64(wait / 5)))
        time.Sleep(wait + jitter)

        log.Warn("email send failed, retrying",
            "attempt", attempt+1,
            "wait_ms", wait.Milliseconds(),
            "error", err)
    }
    return fmt.Errorf("all %d retries exhausted", maxRetries)
}

func isRetryable(err error) bool {
    // Retry on 429, 503, network errors; not on 400, 401, 422
    var httpErr *HTTPError
    if errors.As(err, &httpErr) {
        return httpErr.StatusCode == 429 || httpErr.StatusCode >= 500
    }
    return true  // network errors are always retryable
}

Automatic vs Manual Recovery

Automatic: restart crashed processes (systemd, Kubernetes restart policy), clean up corrupted caches, switch to backup systems. Design carefully — automatic recovery can sometimes amplify a problem.
Manual: data corruption, payment discrepancies, security incidents. These require human judgment. Document the runbook. Test it. Know who is on-call.

Data integrity is your #1 priority during any incident. Never auto-delete or auto-migrate data as part of an error recovery procedure. Take a backup first, always.

Global Error Handler — The Final Safety Net

The global error handler is a single middleware that sits at the outermost layer of your application, catches every error that bubbles up from any layer, and converts it into a properly formatted HTTP response.

Fig 5 — All errors bubble up to one middleware. One place to define every response format.

Two Major Advantages

No forgotten error conditions — every unhandled error falls through to the global handler's default case (500 + "something went wrong"). Nothing silently swallowed.
Zero redundancy — database error handling logic lives in one file, not scattered across 40 repository methods. Change the unique-violation message once, it applies everywhere.

Go — Global Error Handler Implementation

Go doesn't have exceptions — errors are return values. The pattern is to return errors up the call stack and handle them in middleware.

Go — errors/types.gopackage apperr

import "net/http"

// AppError is the canonical error type for this application.
type AppError struct {
    Code    int      // HTTP status code
    Message string   // Safe, user-facing message
    Details any      // Optional: field-level errors for 400s
    Err     error    // Original error — for logging only, NEVER sent to client
}

func (e *AppError) Error() string { return e.Message }

// Constructors
func NotFound(resource string) *AppError {
    return &AppError{Code: http.StatusNotFound, Message: resource + " not found"}
}
func Conflict(msg string) *AppError {
    return &AppError{Code: http.StatusConflict, Message: msg}
}
func BadRequest(msg string, details any) *AppError {
    return &AppError{Code: http.StatusBadRequest, Message: msg, Details: details}
}
func Internal(err error) *AppError {
    return &AppError{
        Code:    http.StatusInternalServerError,
        Message: "something went wrong",  // NEVER expose err.Error() here
        Err:     err,
    }
}

Go — middleware/error_handler.gopackage middleware

import (
    "encoding/json"
    "errors"
    "log/slog"
    "net/http"

    "github.com/jackc/pgx/v5/pgconn"
    apperr "yourapp/errors"
)

type ErrorResponse struct {
    Code    int    `json:"code"`
    Message string `json:"message"`
    Details any    `json:"details,omitempty"`
}

// GlobalErrorHandler wraps a handler that returns an error.
func GlobalErrorHandler(next func(http.ResponseWriter, *http.Request) error) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        err := next(w, r)
        if err == nil { return }

        var appErr *apperr.AppError

        switch {
        // Already wrapped as AppError
        case errors.As(err, &appErr):
            if appErr.Err != nil {
                slog.Error("app error", "err", appErr.Err)
            }

        // Postgres unique constraint violation → 409
        case isPgError(err, "23505"):
            appErr = apperr.Conflict("resource already exists")

        // Postgres foreign key violation → 404
        case isPgError(err, "23503"):
            appErr = apperr.NotFound("referenced resource")

        // pgx no-rows → 404
        case errors.Is(err, pgx.ErrNoRows):
            appErr = apperr.NotFound("resource")

        // Everything else → 500 (never leak internal error)
        default:
            slog.Error("unhandled error", "err", err)
            appErr = apperr.Internal(err)
        }

        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(appErr.Code)
        json.NewEncoder(w).Encode(ErrorResponse{
            Code:    appErr.Code,
            Message: appErr.Message,
            Details: appErr.Details,
        })
    }
}

func isPgError(err error, code string) bool {
    var pgErr *pgconn.PgError
    return errors.As(err, &pgErr) && pgErr.Code == code
}

Python — Global Error Handler (FastAPI)

Python — FastAPIfrom fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from psycopg2 import errors as pg_errors
import logging

app = FastAPI()
logger = logging.getLogger("app")

# --- Custom exception types ---
class AppError(Exception):
    def __init__(self, status: int, message: str, details=None):
        self.status  = status
        self.message = message
        self.details = details

class NotFoundError(AppError):
    def __init__(self, resource: str):
        super().__init__(404, f"{resource} not found")

class ConflictError(AppError):
    def __init__(self, msg: str):
        super().__init__(409, msg)

# --- Global exception handlers ---
@app.exception_handler(AppError)
async def app_error_handler(request: Request, exc: AppError):
    return JSONResponse(
        status_code=exc.status,
        content={"code": exc.status, "message": exc.message, "details": exc.details}
    )

@app.exception_handler(pg_errors.UniqueViolation)
async def unique_violation_handler(request: Request, exc):
    logger.warning("unique_violation", extra={"path": request.url.path})
    return JSONResponse(status_code=409,
        content={"code": 409, "message": "resource already exists"})

@app.exception_handler(pg_errors.ForeignKeyViolation)
async def fk_violation_handler(request: Request, exc):
    return JSONResponse(status_code=404,
        content={"code": 404, "message": "referenced resource not found"})

@app.exception_handler(Exception)
async def unhandled_error_handler(request: Request, exc: Exception):
    # Log the real error internally, never expose it
    logger.error("unhandled_exception", exc_info=exc,
        extra={"path": request.url.path})
    return JSONResponse(status_code=500,
        content={"code": 500, "message": "something went wrong"})

Security — What to Expose, What to Hide

Every error message that leaves your backend is a potential information leak. Treat your error responses with the same care as your API responses.

① Never Leak Internal Details

Database error messages from Postgres contain table names, column names, index names, and constraint names. If you forward a raw pgconn.PgError message directly to the client, an attacker learns your schema and can craft more targeted SQL injection attempts.

What You Got	What to Send to Client
`duplicate key value violates unique constraint "users_email_key"`	`"Email already in use"`
`relation "usres" does not exist` (typo)	`"Something went wrong"`
`deadlock detected on relation 42816`	`"Something went wrong, please retry"`
`stack trace: panic at server.go:142`	`"Internal server error"`

② Vague Auth Errors (On Purpose)

Login endpoints are the most attacked surface in any application. If you return specific messages like "no user with this email exists" vs "password is incorrect", an attacker can enumerate valid emails through a simple loop.

Fig 6 — Specific auth error messages enable email enumeration attacks. Use a single generic message for all auth failures.

③ Safe Logging Practices

Logs are often shipped to third-party aggregation services (Datadog, Grafana Cloud, ELK). In major data breaches, leaked log files exposed millions of records — because engineers had carelessly logged sensitive fields.

Never log: passwords, API keys, credit card numbers, SSNs, full email addresses, session tokens.
Log instead: user ID (not email), correlation/request ID, operation name, error code.
Use a log scrubbing library (e.g. Go: slog with custom handler; Python: structlog processors) to automatically redact known sensitive fields.

Go — safe vs unsafe logging// ❌ UNSAFE — never do this
slog.Error("login_failed",
    "email", user.Email,          // PII leak
    "password", req.Password,     // catastrophic
    "api_key", cfg.OpenAIKey,      // secret leak
)

// ✅ SAFE — IDs and correlation only
slog.Error("login_failed",
    "user_id", user.ID,
    "correlation_id", r.Header.Get("X-Request-ID"),
    "reason", "invalid_credentials",  // generic code, not DB message
)

Follow the OWASP Authentication Cheat Sheet for authentication endpoint security. It covers error message handling, brute-force protection, account lockout, and more.

References & Further Reading

🔒 OWASP Error Handling Cheat Sheet 🔒 OWASP Authentication Cheat Sheet 🐹 Go errors package 🐹 Go slog (structured logging) 🐍 FastAPI Error Handling 🐍 structlog (Python) 📖 MDN — HTTP Status Codes ⚙️ AWS: Backoff with Jitter 📊 Grafana Loki

Backend Field Manual · Error Handling & Fault Tolerance · Chapter 16